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Abstract 

In this paper we discuss some computational problems associated to matched filter- 
ing of experimental signals from gravitational wave interferometric detectors in a 
parallel-processing environment. We then specialize our discussion to the use of the 
APEmille and apeNEXT processors for this task. Finally, we accurately estimate 
the performance of an APEmille system on a computational load appropriate for 
the LIGO and VIRGO experiments, and extrapolate our results to apeNEXT. 
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1 Introduction 



Several earth-based interferometric experiments for the detection of gravita- 
tional waves (GW) are currently under development, and expected to reach 
the data-taking stage in the near future. On a longer time scale, space-based 
experiments are foreseen [1]. These experiments will search, among other, for 
GW generated by inspiralling compact binary-star systems. 

The expected functional form of the signal produced by a coalescing system 
is known to good approximation [2] , so matched filtering is an effective strat- 
egy to extract GW signals from the noise background. Matched filtering is 
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basically obtained by projecting the experimental output (signal plus noise) 
onto the expected theoretical signal, and is best done in Fourier space, us- 
ing Fast Fourier Transform (FFT) techniques (see later for more details). 
The functional form of the expected signal depends however on the physical 
parameters (e.g., masses, angular momenta, eccentricity) of the inspiralling 
system. It is necessary to match the experimental oTitput to a set of expected 
signals (so-called templates) corresponding to points in the parameter space 
that cover the physical region of interest and are close enough (under some 
appropriate metric) to ensure sufficient overlap with any expected GW event. 

The number of needed templates for e.g. the VIRGO experiment is of or- 
der of 10"'' ■ ■ ■ 10®, so the corresponding computational cost is huge by current 
standards. A nice requirement is the possibility of real-time analysis of the 
experimental data, which means that the available computational power is 
enough to process experimental data at the rate at which they are produced, 
so a prompt "trigger" of a GW event is possible [3] . 

Matched filtering to a (large) set of templates is an obvious candidate for par- 
allel processing of the simplest form, e.g., data farming with all elements of 
the farm performing the same computation (Single Program Multiple Data 
(SPMD) processing). Indeed, the experimental data stream is sent to all pro- 
cessors in the farm, each element performing the matching procedures for a 
subset of the physical templates. 

Massively parallel specialized SPMD architectures, with peak processing power 
of the order of 1 Tfiops have been developed by several groups to fulfill the 
computational requirements of Lattice Gauge Theories (LGT) [4]. In this pa- 
per we want to analyze the performance of one such system (the APEmille 
system [5]) for matched filtering of GW signals. 

This paper is not a proposal to use APE-like systems in an actual experimental 
(the relative merits of different computer systems in a large experiment have 
so many facets that they can only be assessed by those directly working on 
it). Rather, the potential usefulness of our work lies in the following: given 
the fast pace of development in the computer industry, an experiment will 
try to delay the commissioning of a production system to as late a point 
in time as possible, since huge gains in price and/or price/performance can 
be expected. This means that very large computing capabilities will not be 
available for much needed early tests and simulations. APE systems might 
provide an answer to this problem. 

The focus of this paper is the measurement of the performance of APE systems 
for matched filtering. Some parts of the paper have however a more general 
scope and refer to general parallelization criteria for the problem at hand. 

This paper is structured as follows: Section 2 briefiy reviews the formahsm 
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of matched filtering. Section 3 evaluates the associated computational cost in 
general terms and discusses some strategies to minimize this quantity. Section 
4 discusses the features of the APE systems relevant for the problem, while 
section 5 presents a procedure for allocation of templates to processors suitable 
for APE and general enough to adapt to other computer systems. Section 6 
presents the result of actual performance measurements made on APE, while 
section 7 contains our concluding remarks. 



2 Formalism 

In this section we briefiy summarize the mathematical formalism recently de- 
veloped to analyze matched filtering of GW signals from coalescing binaries. 
We closely follow the notation presented in [8]. 

We call h{t) the interferometer output, which is the sum of the signal s{t) and 
the noise n{t), while u{t) is a template. n{t) is characterized by its one-sided 
spectral density: 



where E[. . . ] means ensemble expectation value, tilde (~) stands for Fourier 
transformed functions and asterisk (*) for complex conjugation. 

For the sake of dcfiniteness, we consider in the following templates computed 
to second post-Newtonian expansion. They depend, in principle, on several 
parameters: the coalescing phase 0c and coalescing time tc, and the parame- 
ters corresponding to the physical characteristics of the system, called intrinsic 
parameters and globally referred to by the vector 0. A template is precisely 
identified by u(t;6,(f)c,tc). It is believed that the most relevant intrinsic pa- 
rameters are the masses of the binary systems, so as a first approximation it is 
usual to neglect all intrinsic parameters except masses. In this approximation, 
is a vector of two components. 

In a matched filter the signal to noise ratio (SNR) is usually defined by 




(1) 



S _ {h, u) 



(2) 



N rms {n,u) 



where (...) is a particular inner product defined as: 




(3) 
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It can be shown that rms{n,u) = {{u,u)y/'^, so (2) simphfy to S/N — {h,u), 
if normahzed templates are used [6]. 



Filtering a signal means to look for local maxima of the signal to noise ratio, in 
terms of its continuous parameters. The maximization over the phase 0c can 
be done analytically ( it can be seen that the maximum value is obtained com- 
puting two inner product as in (3) on two real templates with opposite phases 
and then summing their square values [10]). Maximization over tc instead is 
achieved at low computational cost calculating the cross correlations by the 
FFT algorithm. Maximizations over the intrinsic parameters are not possible 
analytically. For this reason the normal procedure consists in a discretizations 
of templates in the space of the intrinsic parameters. 

The obvious question concerns the number of templates needed to cover the 
whole parameter space. A differential geometrical approach has been devel- 
oped recently [7]. One introduces a new function, the match M{6i, 02), which 
is the product of two templates with different intrinsic parameters, where a 
maximization is assumed over tc and 4>c- 



M(6>i, 6>2) = max (k(6>i, 0^ + A0e, t^ + AiJ, ii(6>2, 4>c, Q) (4) 

A(pc,Atc 



The match between two templates with near equal parameters may be Taylor 
expanded 



k—Qk 



suggesting the definition of a metric 



_ 1 f d'M(e,<^) \ 



k—Qk 



M(6>, 6> + A6>) ~ 1 - Qij A9'A9^ (7) 



In the limit of close template spacing we have an analytical function able to 

measure the distance between templates in the intrinsic parameter space. The 
metric gij{0) depends on the intrinsic parameters so the real volume covered 
by a template varies locally. This effect can be reduced writing the templates 
in terms of some new variables for which the metric is more regular. One 
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suitable choice is the following: 



M-2/3 



(8) 



MV3 



(9) 



where M is the total mass of the binary system, the reduced mass, and /o 
an arbitrary frequency. This change of variables makes the metric tensor com- 
ponents constant at the first post-Newtonian order, so only small 0-dependent 
contributions are present at the second order approximation. 

It is now possible to simply estimate the total number of templates necessary to 
recover the signal at a given level of accuracy. We calculate the volume covered 
by a single template in the parameter space in term of a minimal value for the 
match, the so called minimal match MM which states a minimal requirement 
on signal recovering capabilities. For example, if we simply use a face centered 
hyper-cubic lattice, we can write the maximum covering volume with: 



where D is the dimension of the parameter space (2 in our example). 

An approximate estimation of the total template number, applicable when N 
is very large [8], is given taking the ratio between the total volume of the 
physically relevant parameter space and the volume covered by a template 
placed in the center of a lattice tile 



Using (11) we estimate that in the range from 0.25 to 10. solar masses the 
total template number is roughly 3.8 ■ lO'' for LIGO and 1.1 • 10^ for VIRGO 
( see Section 5 for the additional assumptions involved in this calculation ). 

A last remark we want to make is that the minimal match requirement also 
determines a threshold value for the signal ( and templates ) sampling fre- 
quency. This frequency can be simply estimated and will be take into account 
later on in our computational estimates. 




(10) 



N{MM) 



AV 
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3 General Strategy 



In this section we present some observations about a general strategy to com- 
pute correlations. Here we consider an ideal case in which most computer- 
related issues are neglected. We also limit our treatment only to the stored 
templates strategy, where templates arc prc-calculatcd, then Fourier trans- 
formed and prepared to be processed and finally stored in memory. This ideal 
case is not unrealistic, given the pace at which actual memory sizes increase 
in real computers. The quantity to be evaluated on every template is given by 



C{t) 



/+00 
r(/)-5(/) e-'^^'fUf 
-oo 



where h{f) is the Fourier transform of a complex template h{t) . 



At present the best way of compute C{t) uses a FFT algorithm, reducing the 
number of needed operations from to n logg n. The FFT algorithm assumes 
input periodicity, while in our case signal and templates are not repeated 
data. The usual trick to overcome this problem [9] consists in padding with 
a certain number of zeros the tail of the templates to be processed. Assume 
that the template has ut points. We pad it so its total length become np, and 
then compute the correlation by using the padded template and np signal 
points. The resulting correlations are only valid in their first np — ut points, 
all remaining points being affected by the periodicity assumption implied in 
the FFT technique. We define padding-ratio the quantity Rpad = Up/ut- The 
result obtained in this way covers a time-period of length {up — Ut) / fs, where 
fs is the sampling frequency of the experimental signal. The last ut data- 
points will have to be re-analyzed in a successive analysis. 

The computing power necessary for an on-line analysis of templates of given 
Ut and up (fioating point operations per second) is given by: 



Cp = ^ {A-np log2(np) + B ■ np) (13) 

np — riT 



A and B are constants, usually of the same order, depending on the specific 
algorithm used. In this paper we use a simple-minded FFT algorithm for 
power of two length vectors that involves A = 5 and B = 12 for the whole 
analysis. Although more general and efficient algorithms exist, our choice does 
not influences strongly the following observations and flnal results. 
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One interesting question concerns the optimal padding that minimizes com- 
puting requirements. If one disregards the fact that (13) holds only for np 
values that are powers of 2, the answer is given by fig.l, where the minimum 
in Up of eq. 14 is plotted as a function of n^, for = IkHz. The behavior 
is very close to a logarithmic function in ut, so computing costs depend very 
weakly on ut- This result is obtained for an optimal choice of np, as discussed 
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Fig. 1. Estimate of the computing power (floating point operations per second 
Flops), versus template length ut for an optimal choice of np. We set A = 5 e 
B = 12 and fs = IkHz (see Eq. 13) 

above. As shown in fig.2, the optimal value for np grows with Ut, implying in 
principle very large memory requests. In practice however (see again fig.2) a 
value of np/riT — 2 ••• 4 is very close to the optimal case for reasonable values 
of Ut- This finally means that deviations from the optimal padding length do 
not produce drastic consequences on the computing power needed to perform 
the analysis, and that np can be easily adjusted to a suitable power of two. 



4 Analysis on APEmille 



The APE family of massively parallel processor has been developed in or- 
der to satisfy the number crunching requirements of Lattice Gauge Theories 
(LGT)[5]. Machines of the present APE generation (APEmille) are installed 
at several sites, delivering an overall peak processing power of about 2 Tfiops. 
The largest sites have typically 1000 processing nodes (i.e., 520 Gfiops) [12]. 
Sustained performance on production-grade LGT codes is about 45 % of peak 
performance. A new APE generation (APEnext) is under development, and 
expected to reach the physics-production stage in early 2004. O{10T flops) 



7 



500 



T 1 1 — I I I I I 



T 1 — I I I I I 



T — I I I I 



400 - 




300 - 



200 - 



100 - 




1 



10 



100 



1000 




Fig. 2. Computing power in Flops versus the padding ratio np/nx, for three typical 
template lengths. We use the same parameters as in the previous figure. 

peak performance installations are being considered. 

APEmille systems are based on a building block containing 8 processing nodes 
(processor and memory) running in Single Instruction Multiple Data (SIMD) 
mode. Each processor is optimized for floating point arithmetics and has a 
peak performance of 500 MFlops in IEEE single precision mode. The proces- 
sors are logically assembled as the sites of a 2 x 2 x 2 mesh, with data links 
connecting the edges. This arrangement is called a "cluster" or a "Cube" . 

Large APEmille systems arc based on a larger 3-dimcnsional mesh of pro- 
cessor, based on replicas of the above-described building block. The resulting 
mesh has a full set of first neighbor communication links. In a typical LGT 
application the whole system works in lock-step mode as a single SIMD sys- 
tem. More important for the present application, each Cube is able to operate 
independently, running its own program under the control of a Linux-based 
personal-computer acting as a host. There is one host machine every 4 Cubes. 
A set of up to 32 Cubes (i.e., 256 nodes) and the corresponding 8 host ma- 
chines is a fully independent unit housed in a standard-size mechanical enclo- 
sure. Each Cube has access to networked disks with a bandwidth of about 4 
MByte/sec. In some APEmille installations, disks have been mounted directly 
on the host PCs. In this case, bandwidth increases approximately by a factor 
4. 

The next generation APE system (apeNEXT) is, for the purposes of the 
present discussion, just a faster version of the same architecture. The only 
(welcome) architectural difference is the fact that the basic logical building 
block (capable of independent operation) is now just one processing node. 
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A large APEmillc system can be seen as a large farm of processors, whose 
basic element is a SIMD machine of dimension 8. A better way to look at 
the SIMD cluster in our case follows the paradigm of vector computing: the 
SIMD cluster applies the input signal to a vector of 8 templates and produces 
a vector of 8 correlations. In a variation of the same method, the same tem- 
plate could be present on all nodes of the SIMD cluster, and correlations at 
8 staggered time points could be computed. Since the number of correlations 
is of the order of 10^ • • • 10^, each element of a large farm (say 10^ SIMD 
clusters) takes responsibihty for several hundreds or thousands of templates. 
This is good news, since APE processors can exploit vector processing within 
the node to reach high efficiency (we just recall here for reader interested in 
architectural details that vector processing effectively helps to hide memory 
access latencies). 

We have written an APE code performing all the steps needed for matched 
filtering on a pre-calculated (and pre-FFT transformed) set (vector) of k tem- 
plates each of length n, and measured its performance on an APE cluster. 
An analysis of the details of the APEmille processor suggest to model the 
computation time Tc{n, k) as 



f{n) is related to the complexity of the computation, that we model as Cin • 
log2(n) -|- C2n -\- C3, following Eq.l3 and introducing one more parameter (03) 
covering machine effects. g{k) is a measure of the processor efficiency as a 
function of the vector length k, that we normalize to g{l) = 1. Taking into 
account that the computation is memory-bandwidth limited (as opposed to 
processing-power limited), we adopt the following functional form for g{k): 



Measured and fitted values for f{n) and g{k) are shown in fig. 3 and fig.4 
respectively. 

APEmille efficiencies are smooth functions of np and k. A rather good value 
of ~ 20%, including all computational overheads, is possible when large sets 
of templates {k > 20) are used. 



Tc{n,k) = f{n)-g{k)-k 



(14) 



g{k) = Ci/k + C5 



(15) 
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Fig. 3. Analysis time as a function of template length Tc(n, 1) = /(n). Mea- 
sured data points are fitted to the functional form of 14, with ci = 8.6 • 10~^sec, 
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Fig. 4. Normalized processor efficiency g{k) as a function of the vector length k. 
Measured points are fitted to (15). 



5 Allocation criteria and processor numbers 



A general templates allocation strategy on real computers has to take into 
account the limited size in memory and the available computing power avail- 
able. Here we present some quantitative aspects of memory and CPU usage 



10 



involved in our analysis, then we give our allocation criteria for the optimal 
template number manageable by a single processor. 

This discussion focuses on criteria that are appropriate for the APE family of 
processors. The focus is to exploit vectorization as much as possible and to 
find ways to reduce input-output bandwidth requirements, so our discussion 
can be applied to a larger class of processors. 

We start from memory. Each processor has k stored templates of similar length 
ut- (In the APEmille case, the term processor must be understood to refer to 
the basic cluster of 8 processing element) . 

Vector processing of all the templates requires that they are all padded to the 
same np, so we need k arrays of np complex words, and matching space for 
the final correlation results. 

There are two basic memory allocation strategies: we may assign different 
sets of k templates to each element in a basic 8 processor cluster, and have 
all of them compute the corresponding correlations for the same time stretch 
(np — nT)/fs) so each cluster computes 8A; correlations. Alternatively, we may 
assign the same set of templates to all processing elements and have each 
of them compute correlations for different time intervals. With this choice 
k correlations are computed for a longer time stretch 8(np — riT)/ fs- The 
best choice between these two nearly equivalent cases is based on bandwidth 
constraints. In APEmille, data items reaching the cluster can be delivered to 
just one element, or broadcast to all of them. In the latter case, bandwidth 
is effectively multiplied by a large factor (x8), so there is an advantage if 
large data blocks must be broadcast to the complete cluster. We will use 
quantitatively these observations later on in this section. 

We now consider processing power. The real-time requirement stipulates that 
each processor cluster completes processing all its templates within an elapsed 
time {np — ut)/ fs (or 8(np — ut)/ fs)- As shown later on, for several realistic 
templates sizes, the processing time T{np, k) is much shorter that the elapsed 
time for the k value allowed by memory constraints. We may therefore try 
to use the same cluster for a different set of templates. This may become 
inefficient since loading a large data base (the new set of templates) may be 
a lengthy procedure. This cost may be reduced by using the same templates 
several times (corresponding to longer elapsed times) before loading a new set 
of templates. 

We disregard the overhead associated to the output of the computer correla- 
tions , that can be made very small taking into account the Gaussian character 
of the noise ( e.g. a 3o"-cut could reduce the number of the output correlations 
to the order of 1% ). More interestingly a cross correlation among closely 
spaced templates could be performed on line packing more density the avail- 
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able information. 

We would like to optimize among these conflicting requirements. Let us con- 
sider the total compute time both for different sets of templates (case 1) or 
the same set of templates (case 2) on each cluster element. We have 

• Case 1: Wc want to compute r sets of 8k correlations each on templates of 
length Up, corresponding to the same time interval. We compute correlations 
on I adjoining time intervals before switching to a new set of templates. The 
computation time can be modeled as 

8rknp/B + npl/B + lrTc{np,k) (16) 

where B is the cluster input-output bandwidth (measured in words per 
unit time). The first term in (16) is the time required to load the templates 
on all processors, the second term is the time needed to broadcast np signal 
points to all cluster elements while the third term refers to the actual com- 
putation, to be performed Ir times. Templates, correlations and input data 
must fit inside the memory, implying that {2k + l)np < Mt, where Mt is 
the available memory on each node ( measured in units of complex words 
). Also, the computation must complete in a time interval l{np — Ut)/ fs- 
In (16) we assume that all data-points arc loaded once. This reduces input- 
output time but reserves a large fraction of memory space to data-points (as 
opposed to templates). Alternatively (case lb), we may load a smaller set 
of data-points every time we start a new computation. The corresponding 
compute time becomes 



8rknp/B + nprl/B + lrTc{np,k) (17) 

while the memory constraint changes to {2k + l)np < Mt- For any phys- 
ical template of length n^, we must maximize 8rk in terms of r, k, I and 
np satisfying all constraints. 
• Case 2: The procedure discussed above can be applied also in this case. The 
corresponding processing time is given by 

rknp/B + 8npl/B + lrTc{np,k) (18) 

This equation differs from (16) since we now broadcast templates while we 
load different data-points to each processing elements. The memory con- 
straint is the same as in Case 1, while the maximum allowed processing 
time is 81 {np — nx)/ fs- Case 2b (multiple data loads) is also easily com- 
puted as 

rknp/B + 8nprl/B + lrT^{np,k) (19) 
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In case 2, we are interested in optimizing rk in terms of the same parameters 
as in the previous case. 

There is one free parameter in the optimization process (/). If we increase / 
we reduce the relative cost associated with template loading, but increase the 
latency associated to the computation. We arbitrarily decide to keep I small 
enough so the latency for any ut is not longer that a fixed amount of time 
Tw- We choose Tw as the time length of the longest template contained in 
the set. This choice may be useful also for data-organization purposes: every 
Tw time interval all correlations corresponding to templates of all lengths are 
made available. The result of the optimization process are given in Table 1 for 
APEmillc and Tabic 2 for apeNEXT. Results depend weakly on the allocation 
procedure discussed above, and are largely dominated by the sustained pro- 
cessing power. Bandwidth limitations are neatly dealt with: if we increase the 
available bandwidth by a factor four (e.g., using local disks) the number of 
templates handled by each cluster increases by less than 10%. With our choice 
of parameters Case lb is the preferred one for almost all template lengths. 





Case 1 


Case lb 


Case 2 


Case 2b 


28 


4824 


4955 


4937 


4854 


29 


4344 


4549 


4535 


4382 


2io 


3720 


4016 


4003 


3775 


2ii 


3177 


3474 


3458 


3252 


2i2 


2511 


2904 


2885 


2606 


2l3 


1792 


2226 


2197 


1893 


2i4 


1143 


1558 


1526 


1315 


2l5 


657 


1091 


1057 


865 


2l6 


349 


679 


644 


510 


2i7 


180 


378 


329 


274 


2l8 




191 


201 


135 


2l9 




86 







Table 1 

Number of templates handled by each APEmille processor cluster, as a function of 
the template length ut- Parameters are (see the text for definitions) fg = 2048Hz, 
B = 5 ■ lO^Wc/ sec, Tw = 1024sec. Numbers in bold flag the best case , while — 
mark cases where allocation can not be performed due to memory limits. 
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We remark that processing time has been directly measured on APEmille, 
while apeNEXT values arc extrapolations obtained by appropriately re-scaling 
the basic machine parameters, such as memory size, I/O bandwidth, and pro- 
cessor frequency. 





Cas(^ 1 


Case 11) 


Case 2 


Cas(^ 2b 


28 


15631 


15765 


15750 


15704 




14599 


14834 


14815 


14728 


210 


13626 


13852 


13832 


13752 


211 


12455 


12844 


12812 


12673 


212 


10969 


11597 


11546 


11321 


2l3 


9401 


9992 


9917 


9735 


2i4 


7782 


8614 


8528 


8264 


2l5 


5881 


6907 


6790 


6466 


216 


3979 


4991 


4852 


4540 


2i7 


2409 


3418 


3280 


3102 


2I8 


1282 


2205 


2107 


1944 


2l9 


673 


1259 


1189 


1087 


220 




641 


657 


546 


221 




285 







Table 2 

Same as 1, extrapolated to apeNEXT. We re-scale processing power per node by a 
factor 3 and bandwidth by a factor 4. Memory increases by a factor 4 and Tw = 
4096sec.. 



6 Actual estimates on processors numbers 

We present here an accurate calculation of the total computational cost, and 
thus of the total processors number, required in order to analyze systems of 
coalescing binaries whose masses are in a certain range. 

The main point of this calculation is the production, given a definite mass 
range, of a suitable set of templates covering the corresponding parameter 
space. We remark that for our purposes (i.e. an estimation of the computa- 
tional cost) we only need a realistic template distribution in the parameter 
space and not a precise covering procedure for placing all the templates. There- 
fore we will adopt a simplified placing algorithm based on a weighted random 
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generation method. It is also important to note that the number of templates 
is very sensible to the shape of the experimental noise spectrum. 

In our template distribution strategy we adopt the regular metric gij obtained 
using the variables defined in (8), (9). We envisage a square lattice on the 9i, 62 
parameter space whose links A^^ are set to A^^ = (mine ^/detg)^^^'^, where 
minimization is on the domain 6 corresponding to the physical parameters 
we are interested in. Under these assumptions the surface corresponding to a 
square lattice in the 9i, 62 space can be approximated with S{d) = ^/detg ■ 
(A^)^, where ^/detg is the value calculated in a point at the center of the 
square. Hence the minimal surface corresponding to a lattice tile is unit. Now 
wc observe that S is proportional to the number of templates needed in that 
square region of the parameter space, and that it is roughly equal to the 
number of templates when divided by AV of eq. (10): 



A^(0)per square = ^Jdetg{^) ■ {Mf / AV. (20) 

Finally we allocate to every square in the lattice a number of templates equal 
to the rounded value of the previous expression, placing the first one in its 
center and the others randomly distributed inside the same region. 

The noise spectral density Sn{f), an experimentally measured and fitted curve, 
has different behavior for each experiment. It imposes particular constraints 
on lower and upper frequency cutoffs and on sampling (or interpolation) fre- 
quency. In tab. 3 we list the fitting functions relative to the VIRGO and LIGO 
experiment and corresponding parameters that we use in our calculations. 



s„(f) = s,j-'^ + s,„f-' + (l + (f/h,„.,f) 



Experiment 


/seism 








fknee 


LIGO 4K 


40 


5.6 ■ 10^6 


3.9 ■ 10'^^ 


1.1 • 10^*^ 


83 


VIRGO 


4 


9. • 10^7 


4.5 • 10^3 


3.24 • 10^6 


500 



Table 3 

Spectral density noise parameters for LIGO and VIRGO from [11]. 

The noise curves for LIGO and VIRGO are quite different. While LIGO is very 
sensible in a narrow frequency interval VIRGO has a lower peak sensibility 
but is better in a wider range of frequencies. 

In tab. 3 f seism indicates the so called seismic frequency, i.e. the frequency be- 
low which seismic noise is expected to dominate over all other noise sources. 
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Slightly different definitions for f seism exists, see [8] and [11]. Integration below 
this limit does not contribute significantly to detectability but is quite expen- 
sive in terms of computing power since it increases the number of templates 
and the template length. 

In our calculations, as proposed in [10], in order to reduce computing require- 
ments we adopt a more restrictive frequency range. We fix a lower and an 
upper frequency bounds // and /„ such that the total SNR recovery is at least 
of 97% (we assume the lower SNR loss of 2% and the upper of 1%). We note 
that template lengths are very sensible to the lower frequency cut-off, as the 
duration time, which influences linearly the storage requirements and loga- 
rithmically the computation cost, scales as fi Our frequency bounds are 
reported in the first two column of tab.4. 

Another point concerns the Minimal Match. We set MM — 0.97, which cor- 
responds to an event rate loss of roughly 10% [8]. 

As discussed at the end of Section 2 the MM level sets not only the density 
of templates in the parameter space (by AV, see (10) ) but also the signal ( 
and templates ) sampling frequency. This frequency could be very height, so 
in some cases, memory requirements could be severe. 

An alternative analysis strategy consist in sampling the signal at a lower fre- 
quency and then obtaining correlations at half-time points by an interpolation. 
In fact this can be simply achieved performing further anti-Fourier transforms 
after introducing in the integrals a suitable phase displacement. The number 
of interpolations obviously multiplies the analysis time. 

In our estimate we fix the sampling frequency by = 2/^ (rounding its 
value to the greatest power of two), so we have fs = 1024 Hz for LIGO and 
fs = 2048 Hz for VIRGO. This means ( see last column in table4) that we 
have to compute correlations at intermediate times by interpolations once for 
VIRGO and once for LIGO experiments. 



Ex])ei'inieiit 


// (Hz) 


/„ 


SNR rec-()\'ery 


/„„ at = 0.97 


LIGO 4K 

VIRGO 


55 (2.0 %) 
26 (1.9 %) 


390 (1.0 %) 
900 (1.1 %) 


97.0% 
97.0 % 


1203 
3253 



Table 4 



Frequency cuts used in our analysis, depending on signal to noise recovery and on 
Minimal Match. 



First, we show in fig. 5 the total computational cost to compute the correlation 
for binary systems whose masses are in a range of rrimin to 10 solar masses. 
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as a function of rrimin, under the assumption of optimal padding. We use the 
parameters hsted in tables 3 and 4. The computational cost roughly follows a 
(fitted) power-law behavior, with exponent of the order of 2.4. This behavior 
can be easily guessed, taking advantage of the fact that the computational 
load of each template depends very weakly on its length, and that the detg{9) 
depends weakly on the 9 variables. Under these assumptions the computa- 
tional cost scales up to log-corrections as the area of the region in 6 space 
corresponding to a given interval of allowed star-masses. The latter can be 
easily shown by power counting to behave as n\lf^. 



05 




Fig. 5. Behavior of the total computation cost (in GFlops) in the infinite memory 
availability case versus the lower mass limit, where maximum total mass is 10. Mq. 
Here we use VIRGO and LIGO noise spectrum, freq. sampling at 2048 Hz and 1024 
Hz respectively and MM = 97%. 

The large difference in computational cost between the two experiment, clearly 
noticeable in fig.5, derives, although in a complex way, from the different noise 
spectra and from the correspondingly different frequency cuts. 

We now specialize the discussion to APE systems. We proceed estabhshing a 
mass interval, then generating its template distribution. We "stretch" template 
lengths to the nearest power of two larger than the actual length (a slightly 
pessimistic assumption). Finally we divide each group of templates of equal 
length by the corresponding number of templates handled by one processor 
cluster ( the bold numbers in tab.l and tab. 2), and sum all the resulting 
quotients. The final result represent the number of APE processor needed to 
satisfy the real time requirement on the given mass interval. 



The computational cost of this matching filter analysis is particularly sensible 
to the lower mass limit because of the increasing template length and of the 
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irregular behavior of the metric tensor gij in that region of the parameter 
space. For this reason it is useful to plot the number of processor versus the 
lower mass limit. The number of nodes (one cluster consist of 8 nodes) for 
a mass interval from rrimin to IOMq is plotted in fig. 6, where we use noise 
spectra relevant for LIGO and VIRGO. This complete our analysis. 
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Fig. 6. Number of needed APE nodes versus the lower mass limit, maximum total 
mass is 10. Mq. Using VIRGO and LIGO noise spectrum, freq. sampling of 2048 
Hz and 1024 Hz respectively, MM = 97% . 



7 Conclusions 



In this paper we have developed a reliable estimate of the computational costs 
for real-time matched filters for GW search from binary star systems, in a 
massively parallel processing environment. 

We have analyzed some criteria to optimally allocate the processing load to a 
farm of processors. We have written a code performing the analysis on an APE 
system and we have measured its performances. Our result is that available 
(APEmille) systems are able to satisfy the requirements of a real-time analysis 
of the complexity corresponding to the LIGO experiment in the mass range 
between 0.25 and 10 Mq. 

The VIRGO experiment ( with its lower and wider noise curve ) has substan- 
tially larger computing requirements that cannot be fulfilled by an APEmille 
system in the same mass range. The new APE generation, expected to be 
available in early 2004, partially closes this performance gap. 
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